Search CORE

15 research outputs found

Polly's Polyhedral Scheduling in the Presence of Reductions

Author: Benaissa Zino
Doerfert Johannes
Hack Sebastian
Streit Kevin
Publication venue
Publication date: 01/01/2015
Field of study

The polyhedral model provides a powerful mathematical abstraction to enable effective optimization of loop nests with respect to a given optimization goal, e.g., exploiting parallelism. Unexploited reduction properties are a frequent reason for polyhedral optimizers to assume parallelism prohibiting dependences. To our knowledge, no polyhedral loop optimizer available in any production compiler provides support for reductions. In this paper, we show that leveraging the parallelism of reductions can lead to a significant performance increase. We give a precise, dependence based, definition of reductions and discuss ways to extend polyhedral optimization to exploit the associativity and commutativity of reduction computations. We have implemented a reduction-enabled scheduling approach in the Polly polyhedral optimizer and evaluate it on the standard Polybench 3.2 benchmark suite. We were able to detect and model all 52 arithmetic reductions and achieve speedups up to 2.21

\times

on a quad core machine by exploiting the multidimensional reduction in the BiCG benchmark.Comment: Presented at the IMPACT15 worksho

arXiv.org e-Print Archive

CISPA – Helmholtz-Zentrum für Informationssicherheit

Das Zeitungssterben - Vergleich von Gegenmaßnahmen in Deutschland und Norwegen

Author: Doerfert Johannes
Publication venue: Hochschule Ostwestfalen-Lippe
Publication date: 01/01/2019
Field of study

Publikationen an der Technischen Hochschule Ostwestfalen-Lippe

Das Zeitungssterben - Vergleich von Gegenmaßnahmen in Deutschland und Norwegen

Author: Doerfert Johannes
Publication venue: Hochschule Ostwestfalen-Lippe
Publication date: 01/01/2019
Field of study

Publikationen an der Technischen Hochschule Ostwestfalen-Lippe

Architecture-parametric timing analysis

Author: Doerfert Johannes
Reineke Jan
Publication venue
Publication date: 01/01/2014
Field of study

Abstract—Platforms are families of microarchitectures that implement the same instruction set architecture but that differ in architectural parameters, such as frequency, memory latencies, or memory sizes. The choice of these parameters influences execution time, implementation cost, and energy consumption. In this paper, we introduce the first general framework for architecture-parametric timing analysis (APTA). APTA computes an expression that bounds the worst-case execution time (WCET) of a program in terms of architectural parameters. This enables to configure a platform, at design or even at run time, in a way that is guaranteed to meet all deadlines, while minimizing implementation cost and/or energy consumption. We demonstrate the feasibility of our approach by imple-menting APTA for a precision-timed (PRET) platform and by evaluating our implementation on Mälardalen benchmarks. I

CiteSeerX

CISPA – Helmholtz-Zentrum für Informationssicherheit

Crossref

Input space splitting for OpenCL

Author: Doerfert Johannes
Hack Sebastian
Moll Simon
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

CISPA – Helmholtz-Zentrum für Informationssicherheit

Crossref

Optimistic loop optimization

Author: Doerfert Johannes
Grosser Tobias
Hack Sebastian
Publication venue
Publication date: 01/01/2017
Field of study

CISPA – Helmholtz-Zentrum für Informationssicherheit

Polly's Polyhedral Scheduling in the Presence of Reductions

Author: Johannes Doerfert
Kevin Streit
Sebastian Hack
Zino Benaissa
Publication venue
Publication date: 01/01/2015
Field of study

CiteSeerX

GPU First -- Execution of Legacy CPU Codes on GPUs

Author: Chapman Barbara
Doerfert Johannes
Scogland Tom
Tian Shilei
Publication venue
Publication date: 20/06/2023
Field of study

Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be directly executable on the device. This complexity makes it challenging for non-experts to leverage GPUs effectively, or even to start offloading parts of a large legacy application. In this paper, we propose a novel compilation scheme called "GPU First" that automatically compiles legacy CPU applications directly for GPUs without any modification of the application source. Library calls inside the application are either resolved through our partial libc GPU implementation or via automatically generated remote procedure calls to the host. Our approach simplifies the task of identifying code regions amenable to acceleration and enables rapid testing of code modifications on actual GPU hardware in order to guide porting efforts. Our evaluation on two HPC proxy applications with OpenMP CPU and GPU parallelism, four micro benchmarks with originally GPU only parallelism, as well as three benchmarks from the SPEC OMP 2012 suite featuring hand-optimized OpenMP CPU parallelism showcases the simplicity of porting host applications to the GPU. For existing parallel loops, we often match the performance of corresponding manually offloaded kernels, with up to 14.36x speedup on the GPU, validating that our GPU First methodology can effectively guide porting efforts of large legacy applications

arXiv.org e-Print Archive

Generalized Task Parallelism

Author: Doerfert Johannes
Hack Sebastian
Hammacher Clemens
Streit Kevin
Zeller Andreas
Publication venue
Publication date: 18/05/2015
Field of study

CISPA – Helmholtz-Zentrum für Informationssicherheit

Runtime pointer disambiguation

Author: Alves Péricles
Doerfert Johannes
Grosser Tobias
Gruber Fabian
Lambrineas Alexandros
Quintão Pereira Fernando Magno
Rastello Fabrice
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/10/2015
Field of study

International audienceIn order to optimize code effectively, compilers must deal with memory dependences.However, the state-of-the-art heuristics available in the literature totrack memory dependencies are inherently imprecise and computationally expensive.Consequently, the most advanced code transformations that compilers have todayare ineffective when applied on real-world programs.The goal of this paper is to solve this conundrum through the hybriddisambiguation of pointers.We provide a static analysis that generates dynamic tests to determine when twomemory locations can overlap.We then produce two versions of a loop: one that is aliasing-free - hence, easyto optimize - and another that is not.Our checks lets us safely branch to the optimizable region.We have applied these ideas on polly-llvm, a loop optimizer built on top of the llvm compilation infrastructure.Our experiments indicate that our method is precise, effective and useful: wecan disambiguate every pair of pointer in the loopintensive polybench benchmark suite.The result of this precision is code quality: the binaries that we generateare 10% faster than those that polly-llvm produces without our optimization,at the -O3 optimization level of llvm.Given the current technology to statically solve alias analysis, we believe thatour ideas are a necessary step to make modern compiler optimizations useful inpractice

Hal - Université Grenoble Alpes